Skip to content

Conversation

majanjua-amzn
Copy link

@majanjua-amzn majanjua-amzn commented Aug 20, 2025

Description:
Adding support for X-Ray's new Adaptive Sampling feature. The feature will allow users to detect anomalies in their application across distributed services and boost the sampling rate of their root service based on their X-Ray sampling rule configuration and optionally a user-provided SDK level configuration. It will also allow users to optionally provide an error capture configuration, where - if configured - the sampler will send unsampled anomaly spans to be exported directly:

  • Configure a max boost rate and boost cooldown in their X-Ray sampling rules that defines how high the sampling rate can go for a given rule based on anomalies detected in the instrumentation and for how long
  • Configure anomaly conditions based on error code (RegEx), latency, and span operation - these will be used to check if a given span is an anomaly. By default, spans with a statusCode > 499 (i.e. 5XX) will be considered anomalies
  • Configure a anomaly/error capture rate that allows spans to be sent directly to the configured span exporter.

The changes:

  • Provide APIs called setSpanExporter and setAdaptiveSamplingConfig to set up the feature - if these are not provided any attempt to use the adaptSampling API will throw an IllegalStateException
  • Provide an API called adaptSampling that accepts a span and its associated spanData:
    • This API performs the necessary logic to determine whether the span is an anomaly based on the user-provided conditions (or default 5XX) and makes the decision whether to count it towards the boost-related statistics and/or whether to send it directly for export based on the error capture rate
  • Update the calls to GetSamplingRules and GetSamplingTargets according to the new API and the collected anomaly statistics
  • Propagate sampling information between instrumented services - specifically, the sampling rule in the root service is passed to all downstream services via the trace state, such that statistics can be recorded meaningfully for boost to be triggered for the root service in a distributed system

Testing:
Re-testing needs to be done based on latest changes in aws-otel-java-instrumentation once ready, hence the need for the draft revision of the PR

  • Manual testing: Application launched on fresh EC2 instance with 3 services
  • Performance testing: Revealed no difference between

Documentation:

  • Coming soon

Outstanding items:
These changes are the first phase and iteration of AWS X-Ray's adaptive sampling feature. As we get feedback, more changes may be introduced to improve or streamline the experience.

Copy link

linux-foundation-easycla bot commented Aug 20, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: majanjua-amzn / name: Mahad Janjua (ba183a5)

@majanjua-amzn majanjua-amzn reopened this Aug 29, 2025
@majanjua-amzn majanjua-amzn marked this pull request as ready for review August 29, 2025 21:47
@majanjua-amzn majanjua-amzn requested a review from a team as a code owner August 29, 2025 21:47
@majanjua-amzn majanjua-amzn marked this pull request as draft August 29, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants